| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 2 | 3 | NA | NA |
| Barb | 3 | 4 | NA | NA |
| Catherine | 3 | 4 | NA | NA |
| David | 2 | 3 | NA | NA |
Spring 2024
How do we identify the effect of a treatment (cause) on an outcome?
Assuming successful randomization to treatment and control, you know it’s the treatment that’s causing the effect.
\(T\) a binary treatment variable
\(Y\) the value of the outcome we observe
\(Y^0\) the value the outcome would take if \(T=0\)
\(Y^1\) the value the outcome would take if \(T=1\)
Let’s think about the last two a bit more carefully…
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 2 | 3 | NA | NA |
| Barb | 3 | 4 | NA | NA |
| Catherine | 3 | 4 | NA | NA |
| David | 2 | 3 | NA | NA |
What do these numbers mean?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 3 | 1 | 3 |
| Barb | 3 | NA | 0 | 3 |
| Catherine | NA | 4 | 1 | 4 |
| David | 2 | NA | 0 | 2 |
\[ Y = TY^1+(1-T)Y^0 \]
We really care about the difference between \(Y^0\) and \(Y^1\). (Why?)
Let \(\delta_i = y^1_i - y^0_i\)
\(E[\delta]=E[Y^1-Y^0]\)
\(E[\delta]=E[Y^1]-E[Y^0]\)
This is the definition of a treatment effect.
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 2 | 3 | NA | NA |
| Barb | 3 | 4 | NA | NA |
| Catherine | 3 | 4 | NA | NA |
| David | 2 | 3 | NA | NA |
If we could see this (invisible) world, what would be our calculation of the treatment effect?
Why can’t we make these calculations in real life?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 3 | 1 | 3 |
| Barb | 3 | NA | 0 | 3 |
| Catherine | NA | 4 | 1 | 4 |
| David | 2 | NA | 0 | 2 |
Does this give us the right answer? Why?
\(T \bot Y^0\)
\(T \bot Y^1\)
\(E[Y^0 | T = 0] = E[Y^0 | T = 1 ]\)
\(E[Y^1 | T = 0] = E[Y^1 | T = 1 ]\)
In a properly executed experiment, there is no association between the potential outcome variables and treatment assignment.
\(E[Y^0 | T = 0] \simeq E[Y^0]\)
\(E[Y^1 | T = 1] \simeq E[Y^1]\)
So…
\(E[\delta] = E[Y|T=1]-E[Y|T=0]\)
The difference between the treatment average and the control average
\(E[\delta]\) is the expected value (mean) of the difference between each unit’s value of \(Y^1\) and \(Y^0\). It is the average treatment effect (ATE). In a sample, this is the sample average treatment effect (SATE).
Even though the individual differences are unobservable (because either \(Y^0\) or \(Y^1\) will be counterfactual for each unit), we can estimate the mean difference via experiment.
\[\text{SATE} = \frac{1}{n}\sum_{i=1}^{n}(y^1_i - y^0_i)\]
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 4000 | 4000 | NA | NA |
| Barb | 2000 | 2000 | NA | NA |
| Catherine | 3000 | 3000 | NA | NA |
| David | 3000 | 3000 | NA | NA |
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 4000 | 1 | 4000 |
| Barb | 2000 | NA | 0 | 2000 |
| Catherine | NA | 3000 | 1 | 3000 |
| David | 3000 | NA | 0 | 3000 |
Here, the people who go to college have different baseline earnings that have nothing to do with going to college.
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 4000 | 1 | 4000 |
| Barb | 2000 | NA | 0 | 2000 |
| Catherine | NA | 3000 | 1 | 3000 |
| David | 3000 | NA | 0 | 3000 |
\[\hat{\delta}_{naive} = E[Y | T = 1] - E[Y | T = 0]\]
\[\hat{\delta}_{naive} = 3500 - 2500 = 1000\]
Is this a good estimate of the SATE? Why or why not?
The treatment may not have a single effect, but may have different effects for different groups in the population. If the treatment and control groups (would) respond differently to treatment, this can bias the estimate of the effect.
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 2000 | 4000 | NA | NA |
| Barb | 2000 | 2000 | NA | NA |
| Catherine | 2000 | 4000 | NA | NA |
| David | 2000 | 2000 | NA | NA |
What is the effect of a college degree here?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 4000 | 1 | 4000 |
| Barb | 2000 | NA | 0 | 2000 |
| Catherine | NA | 4000 | 1 | 4000 |
| David | 2000 | NA | 0 | 2000 |
How might we calculate the effect of a college degree here?
Does this give the right answer? Why or why not?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 3000 | 4000 | NA | NA |
| Barb | 2000 | 3500 | NA | NA |
| Catherine | 3000 | 4000 | NA | NA |
| David | 2000 | 3500 | NA | NA |
What is the effect of a college degree here?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 4000 | 1 | 4000 |
| Barb | 2000 | NA | 0 | 2000 |
| Catherine | NA | 4000 | 1 | 4000 |
| David | 2000 | NA | 0 | 2000 |
How might we calculate the effect of a college degree on earnings here?
Does this give the right answer? Why or why not?
ATE is \(E(Y^1 - Y^0)\) for all units (effect of switching)
ATT is \(E(Y^1 - Y^0)\) for treated units (effect of taking away treatment)
ATC is \(E(Y^1 - Y^0)\) for untreated units (effect of adding treatment)
| Group | \(E(Y^1)\) | \(E(Y^0)\) |
|---|---|---|
| College degree | 1000 | 600 |
| No degree | 800 | 500 |
If 30% of the population has a degree…
Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory. American Sociological Review, 86(3), 532–565.
Y: outcome
T: treatment
U: unobserved confounder
S: affects selection into T
X: affects Y directly
Regression attempts to identify \(T \rightarrow Y\) by adjusting for \(X\) while regressing \(Y\) on \(T\)
Matching and weighting attempt to identify \(T \rightarrow Y\) by ensuring that \(S\) has the same distribution for all values of \(T\)
Both are strategies to close the backdoor path between \(T\) and \(Y\)
These techniques only allow us to account for observed differences between treated and control cases. If \(V\) is unobserved, we can’t close the backdoor path with either of these approaches.
set.seed(1234)
# create the data
obs <- 1e6 # 1M observations to minimize randomness
d <- tibble(
U = rbinom(obs, 1, .5) , # unobserved difference
S = rbinom(obs, 1, .25 + .5*U) , # U -> S
T = rbinom(obs, 1, .25 + .5*S) , # S -> T
Y0 = round(2000 + 2000*U + rnorm(obs, 0, 500)) , # untreated outcome
Y1 = round(3000 + 2000*U + rnorm(obs, 0, 500)) , # treated outcome
Y = T*Y1 + (1-T)*Y0) |> # observed Y
select(-U, -(Y0:Y1)) # keep observed variables| S | T | mean_y |
|---|---|---|
| 0 | 0 | 2499 |
| 0 | 1 | 3498 |
| 1 | 0 | 3499 |
| 1 | 1 | 4500 |
What is the difference within levels of S? Does this give us the right answer? Why?
In an experiment, the treatment and control groups are otherwise the same.
Assumption 1: \(E(Y^1|T=1) = E(Y^1|T=0)\)
Assumption 2: \(E(Y^0|T=1) = E(Y^0|T=0)\)
There exist some observable variables, \(S\), which completely account for the differences between the treatment and control groups.
Assumption 1-S: \(E(Y^1|T=1,S) = E(Y^1|T=0,S)\)
Assumption 2-S: \(E(Y^0|T=1,S) = E(Y^0|T=0,S)\)
If we control for A and B here, we’re not unconfounding the relationship between T and Y. Rather, we’re controlling away part of the true effect of T! This is why we shouldn’t control for (or stratify on) post-treatment variables.
| SES | n | degree | earnings |
|---|---|---|---|
| 1 | 150 | 0 | 2000 |
| 1 | 50 | 1 | 4000 |
| 2 | 100 | 0 | 6000 |
| 2 | 100 | 1 | 8000 |
| 3 | 50 | 0 | 10000 |
| 3 | 150 | 1 | 14000 |
What are the ATE and ATT?
SATE = $2667
SATT = $3000
| SES | n | degree | earnings |
|---|---|---|---|
| 1 | 150 | 0 | 2000 |
| 1 | 0 | 1 | NA |
| 2 | 100 | 0 | 6000 |
| 2 | 100 | 1 | 8000 |
| 3 | 50 | 0 | 10000 |
| 3 | 150 | 1 | 14000 |
What is the ATE here? ATT?
Strata that have only treatment or control cases (not both) are called off support. Strata with both treatment and control cases are in the region of common support.
What is the effect of maternal smoking on infant health?
Data are from a subsample (N = 4642) of singleton births in PA between 1989-1991. See Almond et al. 2005. “The Costs of Low Birth Weight.”
d <- haven::read_dta(here("data", "cattaneo2.dta"))
d <- d |>
haven::zap_labels() |> # remove Stata labels
mutate( smoker = factor(mbsmoke,
labels = c("Nonsmoker", "Smoker")) ,
zweight = (bweight - mean(bweight)) / sd(bweight)) %>%
select( bweight, zweight, lbweight, mbsmoke, smoker, mmarried,
mage, medu, fbaby, alcohol, mrace, nprenatal ) vars n mean sd min max range se
bweight 1 4642 3361.68 578.82 340.00 5500.00 5160.00 8.50
zweight 2 4642 0.00 1.00 -5.22 3.69 8.91 0.01
lbweight 3 4642 0.06 0.24 0.00 1.00 1.00 0.00
mbsmoke 4 4642 0.19 0.39 0.00 1.00 1.00 0.01
mmarried 5 4642 0.70 0.46 0.00 1.00 1.00 0.01
mage 6 4642 26.50 5.62 13.00 45.00 32.00 0.08
medu 7 4642 12.69 2.52 0.00 17.00 17.00 0.04
fbaby 8 4642 0.44 0.50 0.00 1.00 1.00 0.01
alcohol 9 4642 0.03 0.18 0.00 1.00 1.00 0.00
mrace 10 4642 0.84 0.37 0.00 1.00 1.00 0.01
nprenatal 11 4642 10.76 3.68 0.00 40.00 40.00 0.05
| estimate | conf.low | conf.high |
|---|---|---|
| 0.476 | 0.404 | 0.547 |
The t-test gives us the naive estimate of the effect of smoking. If we assume the only difference between smokers and non-smokers is smoking, the effect of smoking is to reduce birthweight by .476 standard deviations (276g).
Note: later we will define formulas so we don’t have to re-type all the variables every time.
A matchit object
- method: Exact matching
- number of obs.: 4642 (original), 4642 (matched)
- target estimand: ATT
- covariates: mmarried, alcohol, mrace
Control Treated
All (ESS) 3778.000 864
All 3778.000 864
Matched (ESS) 2081.684 864
Matched 3778.000 864
Unmatched 0.000 0
Discarded 0.000 0
Rows: 4,642
Columns: 14
$ bweight <dbl> 3459, 3260, 3572, 2948, 2410, 3147, 3799, 3629, 2835, 3880, …
$ zweight <dbl> 0.16813549, -0.17566764, 0.36336038, -0.71469567, -1.6441734…
$ lbweight <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ mbsmoke <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ smoker <fct> Nonsmoker, Nonsmoker, Nonsmoker, Nonsmoker, Nonsmoker, Nonsm…
$ mmarried <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
$ mage <dbl> 24, 20, 22, 26, 20, 27, 27, 24, 21, 30, 26, 20, 34, 21, 23, …
$ medu <dbl> 14, 10, 9, 12, 12, 12, 12, 12, 12, 15, 12, 12, 14, 8, 12, 12…
$ fbaby <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, …
$ alcohol <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ mrace <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
$ nprenatal <dbl> 10, 6, 10, 10, 12, 9, 16, 11, 20, 9, 14, 5, 13, 8, 4, 10, 13…
$ weights <dbl> 0.6057834, 1.3247009, 0.6057834, 0.6057834, 0.6057834, 2.360…
$ subclass <fct> 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, …
subclass_table <- ematch_data %>%
group_by(mmarried, alcohol, mrace) %>%
summarize(
n_t = sum(mbsmoke), # Ntreat
n_c = sum(1-mbsmoke), # Ncon
zbw_t = weighted.mean(zweight, w = mbsmoke), # mean std bw for treated
zbw_c = weighted.mean(zweight, w = 1-mbsmoke), # mean std bw for control
row_diff = zbw_t - zbw_c, # mean treat-control diff
wt_t = weighted.mean( weights, w = mbsmoke), # mean treat weight
wt_c = weighted.mean( weights, w = 1-mbsmoke)) # mean control weight# A tibble: 8 × 10
# Groups: mmarried, alcohol [4]
mmarried alcohol mrace n_t n_c zbw_t zbw_c row_diff wt_t wt_c
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 113 373 -0.592 -0.493 -0.0996 1 1.32
2 0 0 1 291 539 -0.391 0.0506 -0.442 1 2.36
3 0 1 0 29 14 -0.970 -0.253 -0.717 1 9.06
4 0 1 1 22 13 -0.379 -0.514 0.135 1 7.40
5 1 0 0 19 182 -0.775 -0.262 -0.512 1 0.456
6 1 0 1 362 2613 -0.252 0.205 -0.457 1 0.606
7 1 1 0 4 6 -0.824 -0.0936 -0.731 1 2.92
8 1 1 1 24 38 -0.328 0.383 -0.711 1 2.76
The jury is still out on the theory behind some of procedures so it’s not obvious in many cases what the right way is to calculate standard errors.
We will cover bootstrapping later which is now (once again) seen as a good option for getting standard errors.
See Iacus, King, and Porro. 2019. “A Theory of Statistical Inference for Matching Methods in Causal Research.” Political Analysis.
Matching: compare treatment/control differences in outcome on cases that are identical on other observed characteristics (i.e., matched)
Weighting: apply weights so that these stratum-specific differences are aggregated in a way that reflects the distribution of interest (e.g., ATT)
ematch_out2 <- matchit(mbsmoke ~ mmarried + alcohol + mrace +
fbaby + mage + medu + nprenatal ,
data = d,
method = "exact")
ematch_out2A matchit object
- method: Exact matching
- number of obs.: 4642 (original), 1235 (matched)
- target estimand: ATT
- covariates: mmarried, alcohol, mrace, fbaby, mage, medu, nprenatal
Control Treated
All (ESS) 3778.0000 864
All 3778.0000 864
Matched (ESS) 478.5805 362
Matched 873.0000 362
Unmatched 2905.0000 502
Discarded 0.0000 0
But we don’t want a method that requires us to throw away tons of data if we don’t have to!
Propensity scores are a solution to the problem of sparseness. If we can’t find exact matches for each case because of high dimensional (or fine-grained) data, can we reduce the complexity of the data and find “good enough” matches?
| SES | n | degree |
|---|---|---|
| 1 | 150 | 0 |
| 1 | 50 | 1 |
| 2 | 100 | 0 |
| 2 | 100 | 1 |
| 3 | 50 | 0 |
| 3 | 150 | 1 |
A propensity score is a subject’s probability of receiving treatment.
What are the propensity scores for subjects in different levels of S?
Exact matches are also propensity score matches!
Because exact matching is impossible when S comprises many variables, propensity scores allow us to summarize S in a single, continuous variable. This allows comparing “apples to apples” as long as we are comfortable with “appleness” as defined by the propensity score.
How do I estimate p-scores?
logistic regression
probit regression
machine learning (e.g., SVM)
any other classifier…
What do I do with them?
stratification (“interval matching”)
matching
weighting
I will first demonstrate the traditional workflow
Then I will present some newer approaches that aim to rectify the weaknesses of the traditional approach
psmod <- glm( mbsmoke ~ mmarried + alcohol + mrace + fbaby +
mage + I(mage^2) + medu + nprenatal ,
data = d ,
family = binomial )
tidy( psmod )# A tibble: 9 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.04 0.823 -3.70 2.19e- 4
2 mmarried -1.24 0.0999 -12.4 4.05e-35
3 alcohol 1.57 0.185 8.47 2.50e-17
4 mrace 0.667 0.118 5.63 1.78e- 8
5 fbaby -0.405 0.0907 -4.47 7.83e- 6
6 mage 0.312 0.0645 4.83 1.34e- 6
7 I(mage^2) -0.00591 0.00119 -4.95 7.35e- 7
8 medu -0.141 0.0176 -8.02 1.09e-15
9 nprenatal -0.0299 0.0111 -2.70 6.94e- 3
All the same issues that apply to logistic regression in ordinary circumstances apply here (e.g., goodness of fit, functional form).
If the model is wrong, the propensity scores may also be wrong!
The main assessment procedure in the traditional workflow is checking for covariate balance.
# A tibble: 8 × 9
smoker mmarried mage medu fbaby alcohol mrace nprenatal ps1
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Nonsmoker 1 36 16 0 0 0 8 0.0386
2 Smoker 1 36 17 1 0 1 12 0.0386
3 Nonsmoker 1 28 16 1 0 0 12 0.0388
4 Smoker 1 31 15 1 0 0 13 0.0388
5 Nonsmoker 1 34 17 1 0 1 16 0.0418
6 Smoker 1 43 12 0 0 1 10 0.0419
7 Nonsmoker 1 38 13 0 0 0 10 0.0430
8 Smoker 1 32 14 1 0 0 12 0.0432
We are going to focus on weighting only.
| Sex | P(population) | P(sample) | formula | weight |
|---|---|---|---|---|
| Male | .50 | .40 | .50/.40 | 1.25 |
| Female | .50 | .60 | .50/.60 | .80 |
In surveys, we give more weight to groups that don’t show up as much as they should in the sample. The logic with IPTW is the same – in an experiment the treatment and control groups “should” be balanced in their propensity to receive treatment and on any other factors as well.
\[w_i = T_i + (1-T_i)\frac{P(T=1|\mathbf{X}_i)}{1-P(T=1|\mathbf{X}_i)}\]
Our code will be easier to write and read if we pre-define formulas. We want (at least) two versions for different uses:
Creating weights that balance the treatment and control groups on the propensity score is relatively simple.
WeightIt instead of manual calculationWeightIt is an excellent package written by Noah GreiferMatchIt brings many capabilities under one pacakge, WeightIt serves as a wrapper on the myriad weighting algorithms out there, giving them a unified syntax. Summary of weights
- Weight ranges:
Min Max
treated 1.0000 || 1.0000
control 0.0071 |---------------------------| 6.7093
- Units with the 5 most extreme weights by group:
47 43 25 20 11
treated 1 1 1 1 1
1594 1084 4342 4408 3074
control 3.3045 3.6843 3.7092 4.2442 6.7093
- Weight statistics:
Coef of Var MAD Entropy # Zeros
treated 0.000 0.000 -0.000 0
control 1.234 0.664 0.422 0
- Effective Sample Sizes:
Control Treated
Unweighted 3778. 864
Weighted 1497.73 864
This summary will give you a sense of the distribution of weights. The lowest weight is 0.007. These are cases that are so unlike the smokers that they don’t tell us much about the counterfactual world in which they smoke.
The effective sample size gives a sense of how much useful counterfactual information we have about the control cases. It’s calculated as:
\[\Sigma(w)^2 / \Sigma(w^2)\]
\[\text{bias}=\frac{\bar{x}_T - \bar{x}_C}{\sqrt{\frac{s^2_T + s^2_C}{2}}}\]
\[\text{bias}=\frac{\bar{x}_T - \bar{x}_C}{s_T}\]
cobaltcobalt is an excellent package written by Noah Greiferbal.tab (balance tables)bal.plot (to compare single covariate distributions)love.plot (graphical balance checking; same info as bal.tab in graph form)love.plot optionsI don’t want to type the same thing a million times so I define a custom function.
library(cobalt)
love_plot <- function(x) {
love.plot(x,
binary = "std" , # use same formula for binary vars
continuous = "std" , # standardize cont. variables
abs = TRUE , # absolute value
stats = c("m", "ks") , # std. bias and Kolmogorov-Smirnov
s.d.denom = "treat", # use for ATT
line = TRUE , # connect with lines
var.order = "adj" , # sort by adjusted order
thresholds = c(.10, .05)) # rules of thumb
}Looks pretty good. Note: if we hadn’t had I(medu^2) in the model, we would not have had acceptable balance.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.00508 0.0208 -0.244 8.07e- 1
2 mbsmoke -0.382 0.0295 -13.0 8.39e-38
The ATT estimate is thus -.382.
For a publication version, we can and should do bootstrapping of the whole workflow. I am not going to demonstrate that today.